8 research outputs found
Misspecification in Inverse Reinforcement Learning
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function
from a policy . To do this, we need a model of how relates to
. In the current literature, the most common models are optimality,
Boltzmann rationality, and causal entropy maximisation. One of the primary
motivations behind IRL is to infer human preferences from human behaviour.
However, the true relationship between human preferences and human behaviour is
much more complex than any of the models currently used in IRL. This means that
they are misspecified, which raises the worry that they might lead to unsound
inferences if applied to real-world data. In this paper, we provide a
mathematical analysis of how robust different IRL models are to
misspecification, and answer precisely how the demonstrator policy may differ
from each of the standard models before that model leads to faulty inferences
about the reward function . We also introduce a framework for reasoning
about misspecification in IRL, together with formal tools that can be used to
easily derive the misspecification robustness of new IRL models
Lexicographic Multi-Objective Reinforcement Learning
In this work we introduce reinforcement learning techniques for solving
lexicographic multi-objective problems. These are problems that involve
multiple reward signals, and where the goal is to learn a policy that maximises
the first reward signal, and subject to this constraint also maximises the
second reward signal, and so on. We present a family of both action-value and
policy gradient algorithms that can be used to solve such problems, and prove
that they converge to policies that are lexicographically optimal. We evaluate
the scalability and performance of these algorithms empirically, demonstrating
their practical applicability. As a more specific application, we show how our
algorithms can be used to impose safety constraints on the behaviour of an
agent, and compare their performance in this context with that of other
constrained reinforcement learning algorithms
Is SGD a Bayesian sampler? Well, almost
Overparameterised deep neural networks (DNNs) are highly expressive and so
can, in principle, generate almost any function that fits a training dataset
with zero error. The vast majority of these functions will perform poorly on
unseen data, and yet in practice DNNs often generalise remarkably well. This
success suggests that a trained DNN must have a strong inductive bias towards
functions with low generalisation error. Here we empirically investigate this
inductive bias by calculating, for a range of architectures and datasets, the
probability that an overparameterised DNN, trained with
stochastic gradient descent (SGD) or one of its variants, converges on a
function consistent with a training set . We also use Gaussian processes
to estimate the Bayesian posterior probability that the DNN
expresses upon random sampling of its parameters, conditioned on .
Our main findings are that correlates remarkably well with
and that is strongly biased towards low-error and
low complexity functions. These results imply that strong inductive bias in the
parameter-function map (which determines ), rather than a special
property of SGD, is the primary explanation for why DNNs generalise so well in
the overparameterised regime.
While our results suggest that the Bayesian posterior is the
first order determinant of , there remain second order
differences that are sensitive to hyperparameter tuning. A function probability
picture, based on and/or , can shed new light
on the way that variations in architecture or hyperparameter settings such as
batch size, learning rate, and optimiser choice, affect DNN performance
On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
To solve a task with reinforcement learning (RL), it is necessary to formally
specify the goal of that task. Although most RL algorithms require that the
goal is formalised as a Markovian reward function, alternatives have been
developed (such as Linear Temporal Logic and Multi-Objective Reinforcement
Learning). Moreover, it is well known that some of these formalisms are able to
express certain tasks that other formalisms cannot express. However, there has
not yet been any thorough analysis of how these formalisms relate to each other
in terms of expressivity. In this work, we fill this gap in the existing
literature by providing a comprehensive comparison of the expressivities of 17
objective-specification formalisms in RL. We place these formalisms in a
preorder based on their expressive power, and present this preorder as a Hasse
diagram. We find a variety of limitations for the different formalisms, and
that no formalism is both dominantly expressive and straightforward to optimise
with current techniques. For example, we prove that each of Regularised RL,
Outer Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and
Limit Average Rewards can express an objective that the others cannot. Our
findings have implications for both policy optimisation and reward learning.
Firstly, we identify expressivity limitations which are important to consider
when specifying objectives in practice. Secondly, our results highlight the
need for future research which adapts reward learning to work with a variety of
formalisms, since many existing reward learning methods implicitly assume that
desired objectives can be expressed with Markovian rewards. Our work
contributes towards a more cohesive understanding of the costs and benefits of
different RL objective-specification formalisms
Goodhart's Law in Reinforcement Learning
Implementing a reward function that perfectly captures a complex task in the
real world is impractical. As a result, it is often appropriate to think of the
reward function as a proxy for the true objective rather than as its
definition. We study this phenomenon through the lens of Goodhart's law, which
predicts that increasing optimisation of an imperfect proxy beyond some
critical point decreases performance on the true objective. First, we propose a
way to quantify the magnitude of this effect and show empirically that
optimising an imperfect proxy reward often leads to the behaviour predicted by
Goodhart's law for a wide range of environments and reward functions. We then
provide a geometric explanation for why Goodhart's law occurs in Markov
decision processes. We use these theoretical insights to propose an optimal
early stopping method that provably avoids the aforementioned pitfall and
derive theoretical regret bounds for this method. Moreover, we derive a
training method that maximises worst-case reward, for the setting where there
is uncertainty about the true reward function. Finally, we evaluate our early
stopping method experimentally. Our results support a foundation for a
theoretically-principled study of reinforcement learning under reward
misspecification
Neural networks are a priori biased towards Boolean functions with low entropy
Understanding the inductive bias of neural networks is critical to explaining
their ability to generalise. Here, for one of the simplest neural networks -- a
single-layer perceptron with n input neurons, one output neuron, and no
threshold bias term -- we prove that upon random initialisation of weights, the
a priori probability P(t) that it represents a Boolean function that classifies
t points in {0,1}^n as 1 has a remarkably simple form: P(t) = 2^{-n} for 0\leq
t < 2^n.
Since a perceptron can express far fewer Boolean functions with small or
large values of t (low entropy) than with intermediate values of t (high
entropy) there is, on average, a strong intrinsic a-priori bias towards
individual functions with low entropy. Furthermore, within a class of functions
with fixed t, we often observe a further intrinsic bias towards functions of
lower complexity. Finally, we prove that, regardless of the distribution of
inputs, the bias towards low entropy becomes monotonically stronger upon adding
ReLU layers, and empirically show that increasing the variance of the bias term
has a similar effect